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Motivation 



• Given: n samples in ^-dimensional space 

X = (x 1 ,...,x„) e R dxn 
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Motivation 



• Given: n samples in ^-dimensional space 

X = (x 1 ,...,x„) e R dxn 

• Decrease d i=J> dimensionality reduction: 

-PCA 

-MDS 
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Principal Component Analysis 



• Idea: Compute orthorgonal linear transformation 
that transforms the data into a new coordinate 
system s.t. 

-greatest variance on first coordinate axis 
-second greatest variance on second axis 
-etc. 

• Optimal transform for a given data set in the least 
squares sense 

• Dimensionality reduction: project data into lower 
dimensional space spanned by first principal 
components 
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Principal Component Analysis 



Given \n samples scattered in ^-dimensional space, 
written as a matrix 

X = (x 1 > x 2 ,...,x„) € R dx " 



compute the centered covariance matrix: 

C = (X - X)(X - X) T e R d ' d 

(interpretation as map from Bt 1 to R d ) 
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Principal Component Analysis 



computation of C with the “centering matrix”: 

C = (xj)(xjj =XJJ T X T 

1 T 

J = K — 11 

n 



principal component(s): 

eigenvector(s) v, to largest eigenvalue(s) X. of C 

1 / 1 / 

(low rank approximation) 

B Visual Computing Institute | Prof. Dr. Leif Kobbelt 
Computer Graphics and Multimedia 
Data Analysis and Visualization 




Principal Component Analysis 



c = vdv t 

= [vi...vjdiag[/li.../lj[vi...vj r 
*[v 1 -vjdiag[/l 1 .../lj[v 1 ...vj r 

X*:= [v,...v ] r X7 eR‘ ,m 
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Relation to SVD 



• singular value decomposition 



XJ=VI.U T 



C = XJ ( XJ) T =VI,U T UI, r V T 

= VI. 2 V T 
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... for very large dimension d 



C = XJ(XJ) T eR dxd 
C = (XJ) T XJ eR w/n 



Cv = Xv 




C w = (XjJ XJ (XjJ v 
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Example 



10 points in R 2 
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Example 



10 points in R 2 
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Multi-Dimensional Scaling 



Given: For n unknown samples X e R dxn in high- 
dimensional space 

X = {x,,...,x }, x. e R d 



we are given a matrix D e R nxn 
(squared) distances: 




of pairwise 
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Multi-Dimensional Scaling 



samples X in some abstract space: 

X = {x 1 ,...,x„}, x,. e A 

matrix D e R of pairwise abstract distances: 



B Visual Computing Institute | Prof. Dr. Leif Kobbelt 
Computer Graphics and Multimedia 
Data Analysis and Visualization 




Multi-Dimensional Scaling 



Goakfind an embedding of X in a low-dimensional 
space such that the pairwise (variations of) 
distances D are preserved. 



P(D,D) 



J t (D-D)J 



2 

F 



other measures p(D,D) are possible 
but they cannot be solved easily. 
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Multi-Dimensional Scaling 



closed form solution: 

first q eigenvectors v x , . . . , \ q of the matrix 



eR nxn 

define the coordinates of a ^-dimensional 
embedding 
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Multi-Dimensional Scaling 



Principal Components projection of the digits (time 0.00s) 
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Motivation 



• Given: n samples in ^-dimensional space 

X = (x 1 ,...,x„) e R d ® 

• Decrease n i=J> clustering: 

-k-means 

-EM 

- Mean shift 
-Spectral clustering 
-Hierarchical clustering 
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Cluster Analysis 



• Task: Given a set of observations / data samples, 
assign them into clusters so that observations in 
the same cluster are similar. 
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Cluster Analysis 



• Task: Given a set of observations / data samples, 
assign them into clusters so that observations in 
the same cluster are similar. 
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k-means Clustering 



• Idea: partition n observations into k clusters in 
which each observation belongs to the cluster with 
the nearest mean. 



• Given : data samples Xj , • • • , x n x t eR d 

• Goal: partition the n samples into k sets (k < n) 

Sj, S 2 , S k such that k 



argmin = 2 j ^ x “t l 



i=l XjeSi 



is minimized, where //,• is the mean of points in S { . 
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k-means Clustering 



Two step algorithm: 

-Assignment step: Assign each sample to the cluster with 



k: 


x , -m' 


< 


X, 111 




J i 




J i 



, V/* 



Update step: Calculate the new means to be the centroid 
of the observations in the cluster. 

S‘ l ^ 



m 



t + 1 



x ; 



i x i &s i 



Iterate until convergence (assignments change no 
longer) 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering 
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k-means Clustering - Comments 



• Advantages: 

- Efficient 

-Always converges to a solution 

• Drawbacks: 

-Not necessarily globally optimal solution 
-#clusters k is an input parameter 
-Sensitive to initial clusters 

-Cluster model: data is split halfway between cluster 
means 



B Visual Computing Institute | Prof. Dr. Leif Kobbelt 

Computer Graphics and Multimedia 
Data Analysis and Visualization 




Clustering Results 



Different cluster analysis results on "mouse" data set: 

Original Data k-Means Clustering EM Clustering 
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EM Algorithm 



• Expectation Maximization (EM) 

• Probabilistic assignments to clusters instead of 
deterministic assignments 

• Multivariate Gaussian distributions instead of 
means 
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EM Algorithm 



• Given: data samples X = {x p ...,xj, x. eR d 

• Assumption: data was generated by k Gaussians 

• Goal: Fit Gaussian mixture model (GMM) to dataX 
Find (j = i,...,k) 

-means 

-covariances of the Gaussians L . 

-probabilities (weights) C0j that the samples come from 
the Gaussian j 
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EM Algorithm - Example (1 D) 




• Three samples drawn from each mixture component 

• means: - -2,p 2 - 2 
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EM Algorithm - Example (2D) 



Waiting time vs Eruption time 
Old Faithful geyser 




Eruption time (mins) 
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EM Algorithm - Example (2D) 



Waiting time vs Eruption time 
Old Faithful geyser 




Eruption time (mins) 
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EM Algorithm - Example (2D) 



Waiting time vs Eruption time 
Old Faithful geyser 




Eruption time (mins) 
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EM Algorithm 




Initialization: Choose initial estimates 



a>°j,n°j, z", j = i,...,k and compute the initial 
log-likelihood 



L°=-> 
n t! 



* 


/ 




1>N 


n x , 




VH 




J 





E-step: Compute 

^ m // L u m x^ m \ 



y m = 
f v 



/ / aa m x ^ m \ 

(0 l • ^x ; n z , E, J 



/ l V . . ^ j lj • • • , k 



1=1 






Visual Computing Institute | Prof. Dr. Leif Kobbelt 
Computer Graphics and Multimedia 
Data Analysis and Visualization 




EM Algorithm 




M-step: Compute new estimates (j=l, 



CO 



m + 1 




n 






m + 1 







Convergence check: Compute new 
likelihood 



L^= l -±\o g 


^ k / 




i<r* 


m+1 yffll 1 I 

^7 ’^7 / 


^ i=l 


W =1 


y 
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Example (2D) 



Ground truth: 

Means: hi = 



0 



M2 — 



Covariance matrices: 

3 ^ y '= ( \ 

Z / \ 

Weights: 

w i = 0.6, W 2 = 0.4 



Sl=, 0 I)’ S2 - 



-2 

0 



0 



Input to EM-algorithm: 
1000 samples 
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r 

Initial Estimate 



m = 0, Za°)= —3. 9750 
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-10 -5 0 
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Initial density estimation: 

_ ( 0-08\ _ /— 2.07\ 

“ \^3.92 J ,M2 “ L— 0.23 y 

(centroids of k-means result) 

w\ = 0.5, W 2 = 0.5 

Ei = H2 = I2 





r 

1st Iteration 
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2nd Iteration 
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3rd Iteration 
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Estimates after three iterations: 

_ /0.08\ _ /— 2.02\ 

Ml ~ 1^3.94 J ,/i2 “ V -0 - 17 / 

_ / 2.75 0.06\ 

Ll “ ^0.06 0.48 ) ’ 

/ 0.87 -0.02\ 

_ ^-0.01 1.79 ) 

w\ = 0.59, W 2 = 0.41 
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Mean Shift Clustering 



• Non-parametric clustering technique 

• No prior knowledge of #clusters 

• No constraints on shape of clusters 
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Mean Shift Clustering - Idea 



• Interprete points in feature space as empirical probability 
density function 

• Dense regions in feature space correspond to local 
maxima of the underlying distribution 

• For each sample: run gradient ascent procedure on local 
estimated density until convergence 

• Stationary points = maxima of distribution 

• Samples associted with the same stationary point are 
considered to be in the same cluster 
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Mean Shift Clustering 



i d 



• Given: data samples x l ,---,x n x t eR' 

• Multi-variate kernel density estimate with radially 
symmetric kernel K(x) and window radius h 



l 



n 



f to=-nrZ* 






nh‘ 



x-x, 

h 



\ 



J 



The radially symmetric kernel is defined as 

^( x ) = c m4 ? " 2 



where c k d is a normalization constant 
• Modes of density function are located at zeros of 
gradient function V/(x) = 0 
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Mean Shift Clustering 



Gradient of density estimator 



V/(x)= 2Ck ’ d 



nh 



d+2 



n 



i* 



i — 1 



f 



v 



x-x 



h 



2 \ 



J 



n 






i — 1 



a 



v 



x-x, 




h 


) 



72 



i* 



/=! 
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x-x, 


2\ 




h 


) 


V 



-X 



where g M=-fc (x) denotes the derivative of the 
kernel profile &(x) 
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Mean Shift Clustering 



Gradient of density estimator 



V/(x)= 2Ck ’ d 



nh 



V. 



d+ 2 






l* 



i — 1 



A 



v 



x-x 



/* 



2 \ 



y 



Y 

proportional to density 
estimate at x 



n 






i — 1 



A 



v 



x-x, 


2\ 
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J 
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y l « 

v 
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V 






m 



'h 



(*) 



mean shift vector m A (x) points toward direction of 
maximum increase in the density. 
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Mean Shift Clustering 



Mean shift procedure for sample : 

1 . Compute mean shift vector m(x^ ) 

2. Translate density estimation window 

*;*' = *! + ) 

Iterate 1 . and 2. until convergence, i.e., 

V/(x,)=0 
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Mean Shift Clustering 
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Mean Shift Clustering 
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Mean Shift Clustering 



o 



o 



o 



o° o 



o 



O ° o 00° °o° ° 

o o o o o _o_.. 

O o"'sO 

o \ 



o 



o 



o 



£ 



I 



V 



o \ o 



o 



o 



o 



o 



o 






Visual Computing Institute | Prof. Dr. Leif Kobbelt 
J Computer Graphics and Multimedia 
Data Analysis and Visualization 




KWTHAACHEN 

UNIVERSITY 



Mean Shift Clustering 
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Mean Shift Clustering 
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Mean Shift Clustering 
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Mean Shift - Comments 



• Advantages: 

- No prior knowledge of #clusters 
-No constraints on shape of clusters 

• Drawbacks: 

-Computationally expensive: 

■ Run algorithm for every sample 

■ Identification of sample neighborhood requires multi-dimensional 
range search 

- How to choose the bandwidth parameter h ? 
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Summary 



• Given: n samples in ^-dimensional space 

X = (x,,...,x„) e R d x " 

• Decrease d dimensionality reduction: 

-PCA 

-MDS 

• Decrease n iz|> clustering: 

-k-means 

-EM 

- Mean shift 
-Spectral clustering 
-Hierarchical clustering 
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Spectral Clustering 



• Model similarity between data points as graph 

.* / 

• Clustering: Find connected components in graph 
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Spectral Clustering 



• Model similarity between data points as graph 






.* 



(weighted) Adjacency Matrix W: 

(\ 0 0 0 \ 

0 10 0 



Degree Matrix D: 



0 0 10 

\0 0 0 1 / 
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Spectral Clustering 



• Graphs: 

Similarity graph: fully connected, model local neighborhood relations 

_ \\x —x -|| 2 

■ Gaussian kernel similarity function: Wij = e * 2o .2 J 

- K-nearest neighbour graph 

- ^-neighbourhood graph 
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Spectral Clustering 



Model similarity between data points as graph 



.* 



.* 



(weighted) Adjacency Matrix W: 

(\ 0 0 0 \ 

0 10 0 



Degree Matrix D: 



0 0 10 

\0 0 0 1 / 



Graph Laplacian L = D - W: 
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Spectral Clustering 



• Properties of the Graph Laplacian L: 

- For every vector /el": f T Lf = \ Yaj = i ~ fo ) 2 

- L is symmetric and positive semi-definite 

- The smallest eigenvalue of L is 0 

■ The corresonding eigenvector is the constant one vector 1 

L has n non-negative, real-valued eigenvalues 0 = Ai < A 2 < • • • < A n 
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Spectral Clustering 



• The multiplicity k of the eigenvalue 0 of L equals the number of connected 
components in the graph 

- Consider k = 1 . Assume f is eigenvector with eigenvalue 0: 



- The sum only vanishes if all terms Wij(fi — /j) 2 vanish 

- If two vertices are connected (their edge weight >0) fi = fj 

- f needs to be constant for all vertices which can be connected by a path 

All vertices of a connected component in an undirected graph can be connected by a 
path: 

■ f needs to be constant on the whole connected component 




B 
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Spectral Clustering 



• Laplacian of graph with 1 connected component has one constant vector 1 
with eigenvalue 0 



• For k > 1 : Wlog. assume that vertices are ordered according to connected 
components 




\ 







• Each Li is a graph Laplacian of a fully connected graph: 

Each Li has one eigenvalue 0 with constant one vector on the i-th connected comp. 



• Spectrum of L is given by union of the spectra of Li 
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Spectral Clustering 



• Graph: 




• Graph Laplacian 



1 -1 
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• Eigenvectors for eigenvalues 
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Spectral Clustering 




• Project vertices into subspace spanned by k eigenvectors 



• Projected vertices: 
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• K-means clustering recovers the connected components 



Embedding is the same regardless of data ordering 
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Spectral Clustering 



• Similarity Graph: 
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Spectral Clustering 



• Similarity Graph: 
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• Eigenvalues : 0, 0.4, 2, 2 



• Eigenvectors : 
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Spectral Clustering 




• For fully connected graph we want to find the Min-Cut: 

Partition graph into 2 sets of vertices such that the weight of edges connecting them 
is minimal: 

■ Vertices in each set should be similar to vertices in the same set, but dissimilar to 
vertices from the other set 
- Partitions often not balanced: isolated vertices 
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Spectral Clustering 




• For fully connected graph we want to find the Normalized Cut: 

Partition graph into 2 sets of vertices such that the weight of edges connecting them 
is minimal 

- Partitions should have similar size 
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Spectral Clustering 



• Min-Cut: minimize cut(A,B) = J2ieAjeB w ij 

• Normalized Cut: minimize Ncut(A , B ) = cut(A , ) 

vol(A) = 'ZieAdi 
~ iAa) + B) minimal if vol(A) = vol(B) 
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Spectral Clustering 



• Reformulate with Graph Laplacian Ncut(A , B ) = cut(A, B )( ) 



• Construct f: fa — 
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I vol(B) Pvol(A) 

= voI(A)a — - - voI(B)a — — — = 0 
y voi(A) y voi(B) 

f T Df = vol(V) 

f T Lf = vol(V)Ncut(A, B) 
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Spectral Clustering 



• Reformulate Ncut: & = vol ( v ) N ^( A ^ B ) 

f 1 Df vol(V ) 



rj~i 

• Minimize subject to Df _L 1 



- Partition (cluster) assignment by thresholding f at 0 

- NP hard to compute since f is discrete 

- Relax problem by allowing f to take arbitrary real values 

■ Solution: second eigenvector of L' ~D~ X L (normalized Graph Laplacian) 

• For k > 2 we can similarity construct indicator vectors like f and relax the 
problem for minimization: 

Project the vertices into the subspace spanned by the first k eigenvectors of L‘ 

- Clustering the embedded vertices yields the solution 

• Spectral clustering (with normalized Graph Laplacian) approximates Ncut 




Visual Computing Institute | Prof. Dr. Leif Kobbelt 
Computer Graphics and Multimedia 
Data Analysis and Visualization 




Spectral Clustering 



Mean Shift 
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Spectral Clustering 



• Summary: 

- Useful for non-convex clustering problems 

Computation intensive because of eigenvalue computation (for large matrices) 

- Choice of k necessary: 

■ A heuristic can be used that tries to find jumps in the eigenvalues (eigengap) 

- Similarity has to be defined for graph construction: 

■ Size of Gaussian kernel? 

■ Size of neighbourhood? 
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Hierarchical Clustering 



• Bottom up: 

- Each data point is it‘s own cluster 
Greedily merge clusters according to some criteria 
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Hierarchical Clustering 



• Requirements: 

Metric: distance between data points d(x, y ) 

- Linkage: distance between data point sets: 

■ Maximum linkage: l(A, B ) = max d(x , y) : x G A,y £ B 

. Average linkage: l(A, B) = ^ E v eB d ( x ’ v) 

■ Ward linkage: l(A,B)= ^ \\xi - m A UB || 2 

ieAUB 

~m A 

ieA ieB 
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Hierarchical Clustering 



• Algorithm: 

- Start out with a cluster for each data point 

Merge two clusters that result in the least increase in linkage criteria 
Repeat until k clusters remain 

• Maximum linkage: 

- Minimizes maximimal distance of data points in each cluster 

• Average linkage: 

Minimizes average distance of data points in each cluster 

• Ward linkage: 

- Minimizes inter-cluster variance 
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Hierarchical Clustering 



• We can add connectivity constraints that enforce which clusters can be 
merged 



Without connectivity constraints (time 0.79s) 



With connectivity constraints (time 0.16s) 
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Hierarchical Clustering 



• Summary: 

- Flexibel: any pairwise distance can be used 

- Choice of k, distance and linkage necessary 

Instead of specifying k we can use a heuristic which stops cluster merging if the 
linkage increases too much 

Given connectivity constraints hierarchical clustering scales well for large number of 
data points 

- How do we choose connectivity constraints? 

■ K-nearest neighbour graph 

■ ^-neighbourhood graph 
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